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1. Introduction 


Reference crop evapotranspiration (ET,) is a variable used in irrigation planning, water resources 
management, and hydrological studies [1]. Evapotranspiration is a nonlinear and complex 
phenomenon [2]. Hence, it is essential that robust and nonlinear methods should be used for 
modeling this phenomenon. In this regard, data-mining methods are a good idea for modeling 
ET). Data-mining methods have been used in many studies for solving complex and nonlinear 
problems. Some of the applications of data-mining methods are river flow modeling [3], 
reservoir operation [4], minimizing irrigation deficiencies [5], optimization of energy 
management [6], precipitation modeling [7,8], modeling water quality parameters [9], flood 
frequency analysis under climate change [10], estimating pier scour depth [11], and modeling 
seismic retrofit cost estimation [12]. 


So far, many methods, based on available meteorological parameters in different geographical 
and climatic conditions, have been proposed to determine the ET,. Traore et al. [13] estimated 
the ET, using an artificial neural network (ANN) in Burkina Faso. Results of the study indicated 
that ANN is highly capable of evaluating ET). Rahimikhoob et al. [14] compared the M5 
decision tree model and ANN to estimate ET, in a dry climate. This study showed that ANN 
estimated ET, better than the M5 decision tree model. But M5 and ANN models calculated ET, 
with reasonable accuracy, and the results were close to those of FAO 56 Penman-Monteith (PM) 
equation. Yassin et al. [15] estimated the ET, using ANN and gene expression programming 
(GEP) in dry climates. Results showed that the eight ET, models produced by using the ANN 
technique were slightly more accurate than those for the GEP technique. Caminha et al. [16] 
estimated the ET, using data-mining predictor models and feature selection. Results showed that 
highly-accurate models could be produced by using the MS tree algorithm and feature selection 
technique. Mehdizadeh [17] estimated daily ET, using artificial intelligence. Local performance 
of the models showed that MARS and GEP approaches could determine daily ETo using 
meteorological parameters and residual ET, data as inputs. However, MARS had the best 
performance in meteorological-data scenarios. Ferreira et al. [1] modeled daily ET, with limited 
climatic data using the MARS algorithm and FAO 56 PM equation. MARS model showed 
superior performance in all scenarios. Models that used solar radiation had the best performance, 
followed by those that used relative humidity and wind speed. Ehteram et al. [18] employed a 
hybrid of support vector regression (SVR) and cuckoo search (CS) algorithm, M5, GEP, and 
adaptive neuro-fuzzy inference system (ANFIS) for modeling ET, in India. Results indicated 
more accuracy of SVR and CS hybrid in modeling ET, than other investigated algorithms. Wang 
et al. [19] examined the generalized evapotranspiration models with limited data based on GEP 
and RF in Guangxi, China. Results showed that RF-based ET, models performed slightly better 
than GEP-based models. Fan et al. [20] estimated daily ET, with local and external 
meteorological data using M5, RF, lightGBM and empirical equations of Makkink, Tabari, 
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Hargreaves-Samani, and Trabert in humid areas. Results showed that all three soft computing 
models produced better daily ET, estimates than corresponding empirical models using the same 
input variables. Ferreira and Cunha [21] explored a new method for estimating daily ET, based 
on hourly temperature and relative humidity using ANN, RF and CNN models. Results showed 
that the developed CNN models offer the best performance in all cases. Granata et al. [22] 
developed some artificial-intelligence-based approaches to estimate actual evapotranspiration in 
lagoons. Results showed that RF and K nearest neighbors (KNN) models performed better than 
acute respiratory distress syndrome (ARDS) algorithm and MLP models. Yamag and Todorovic 
[23] employed three data-mining methods, including ANN, KNN and AdaBoost for modeling 
ET,. Results indicated better accuracy of ANN and KNN. Ashrafzadeh et al. [24] used SARIMA, 
SVM and GMDH for modeling long term ET, in northern Iran. Results showed that SARIMA 
outperformed SVM and GMDH. Zhang et al. [25] employed four different ANN methods for 
estimating ET, in Henan province, China. Results indicated that ANN methods can successfully 
estimate the ET, in Henan province. Niaghi et al. [26] used four data-mining methods including 
GEP, MLR, RF and SVR for modeling ET,. Results showed good accuracy of these methods. 
Feng and Tian [27] modeled the ET, by using KNN method. Results showed good precision of 
this method. 


According to the authors’ best knowledge, different data-mining methods have been used for 
modeling reference crop evapotranspiration (ET,). However, in these studies, the critical issues 
such as impacts of climate on the performance of data-mining methods, uncertainty, and 
computation time are not considered. Therefore, in the present study, different data-mining 
methods including ANN, M5 decision tree, LS-SVM, MARS, and RF are employed for 
modeling ET, by considering the impact of climate, uncertainty, computation time and accuracy. 
In the present study, the uncertainty will be considered by evaluating coefficient of variation of 
evaluation criteria for each algorithm in several random runs. For considering the impact of 
climate on the performance of data-mining methods, different meteorological stations in two 
climates will be considered for modeling ET). Finally, the best data-mining method for each 
climate will be presented based on the accuracy, uncertainty, and computation time. 


The rest of the present study is as follows: Section 2 presents the methodology of the present 
study, including introducing the study area, data used, investigated methods, evaluation criteria, 
limitation of the present study, and model ranking. Section 3 presents the results of sensitivity 
analysis and outcomes of empirical equations and data-mining methods. Section 4 offers the 
discussion about the obtained results. Section 5 presents the conclusion and novelty of the 
present study. Figure 1 shows the workflow of the present study. 


2. Methods 
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2.1. Study area 


In this study, two provinces of Mazandaran and Semnan, in the north of Iran, were considered to 
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Fig. 1. The workflow of present study for modeling ET,. 


calculate the ETp. The Mazandaran province, with an area of about 24,000 km’, lies between 35° 


46' to 36° 58' north latitude, and 50° 21' to 54° 08' east longitude. The natural conditions of 


Mazandaran province represent two significant areas of Alborz mountains and coastal plains. 
Semnan province covers 5.8% of Iran, with an area of 97491 km’. This province lies between 


340: 


13' to 37° 20' north latitude and 51° 51' to 57° 3' east longitude. Its border provinces are 
Mazandaran and Golestan in the north, Isfahan in the south, Khorasan in the east, and Tehran in 
the west. Fig. 2 shows the geographical location of the two provinces of Mazandaran and 


Semnan. Table | shows the synoptic stations of Mazandaran and Semnan provinces. 


Table 1 
Synoptic stations of Semnan and Mazandaran provinces. 
Station Altitude | Longitude | Latitude Station Altitude | Longitude | Latitude 
(m amsl) E N (m amsl) E N 
Semnan 1130 53.32 35.34 Sari 23 53.00 36.33 
Shahrood 1380 54.57 36.25 Dasht-e-naz 12 53.11 36.37 
Garmsar 850 52.25 35.20 Ghaemshahr 15 52.46 36.27 
Damghan 1170 54.61 35.44 Babolsar -21 52.39 36.43 
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Fig. 2. The geographical location of Mazandaran and Semnan provinces. 


2.2. Data used 


In the present study, different parameters including minimum absolute temperature (Tmin-abs 
(°C)), minimum temperature (Tmin (°C)), maximum absolute temperature (Tmax-abs (°C)), 
maximum temperature (Tmax (°C)), mean temperature (Tmean (°C)), minimum relative 
humidity (Hmin (%)), maximum relative humidity (Hmax (%)), mean relative humidity (Hmean 
(%)), wind direction (W-d (deg)), wind speed (W-s (m/s)), and sunshine hours are considered as 
inputs for modeling and estimating reference crop evapotranspiration (ET,). The statistical 
criteria and number of samples of target data in different stations are presented in Table 2. These 
data are provided by Water Resources Management Company, Tehran, Iran. 
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Table 2 
Statistical criteria of inputs and target data in different stations. 


Station Mean (mm/day) | Min (mm/day) | Max (mm/day) | Std dev (mm/day) | Number of samples 
Semnan 3.89 0.85 7.90 2.18 192 
Shahrood 3.86 0.77 8.23 2.24 276 
Garmsar 3.44 0.76 7.18 1.92 288 
Damghan 3.17 0.67 6.89 1.91 276 
Sari 2.54 0.71 5.28 1.37 204 
Dasht-e-naz 2.65 0.85 5.52 1.39 144 
Ghaemshahr 2.48 0.71 4.96 1.31 192 
Babolsar 2.48 0.67 5.00 1.28 204 


Many researchers have modeled various phenomena using data-mining methods [7,28,29]. In 

this study, the ET, is modeled using intelligent and empirical methods. In this regard, 70% of 

data is considered for training period, and 30% of data is used for the testing period. Also, 

random calibration method is used for training and testing machine learning algorithms [8, 23, 

24]. 

Data normalization before entering them into a model is one of the essential steps in using data- 

mining methods. When the range of model changes is high, normalization will significantly help 

the model to have better and faster training. When the data is normalized, the accuracy and speed 

of the network increases. The following equation describes how to normalize the data [30]: 

aii Og (1) 
max— X min 

where, X, is normalized value of X; input, and Xia, and Xmin are Maximum and minimum data 

values, respectively. 


2.3. Artificial neural network (ANN) model 


The ANN consists of three layers: Input, output, and hidden layers, between the input and output 
layers. The ANN may be expressed as a network of interconnected neurons [30]. The underlying 
unit in the ANN is a neuron or node. The nerve cells are connected by synapses, which each 
synapse has a weight factor. Artificial neural networks are nonlinear models and use a structure 
that links the inputs and outputs of each system to represent complex nonlinear processes. The 
structure of each ANN is expressed as (i, j, k), where 1 represents the number of nodes in the 
input layer, j represents the number of layers in the hidden layer, and k represents the number of 
layers in the output layer [31]. The target value in ANN is calculated as follows: 


¥x)=Daf (aa =Yw x, +B, 2) 


where, a, and w ,, are weights of the network, £, is bias of the network, f is a transfer function, 
x, is j'" input, n denotes the number of neurons in the hidden layer, and g is number of inputs. 


The number hidden layer and number of its neuron are considered equal to one and five, 
respectively that are similar to the study of [32]. For more information about ANN, please see 
[11]. 
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2.4. M5 decision tree model 


The M5 decision tree model was introduced by Quinlan in 1992, based on a binary decision tree 
that also has linear regression functions on the terminal nodes that form a link between the input 
and output variables [33] (Fig. 3). The M5 tree model is one of the most common tree models in 
which the multidimensional parameter space is subdivided into subspaces and substrates, and a 
linear regression model is created for each subspace in the leaf [34]. This model focuses on 
quantitative data, which increases the importance of the model compared to other models [35]. 
Standard deviation selects the best feature for splitting the dataset into each node [36]. The M5 
model is obtained by using standard deviation reduction calculated as follows: 


E 


i 


SDR =sd(E)- 
i |E| 


sd (E,) (3) 


In this standard deviation equation, E is a set of samples that reach the node, and Ei is a subset of 
input data to the parent node. These steps are completed until the proper tree structure is formed. 
In that way, the tree is pruned in the back step to deal with overfitting [34]. 


Fig. 3. Schematic structure of the M5 tree model. 


2.5. Multivariate adaptive regression splines (MARS) 


The MARS algorithm is a nonlinear and non-parametric method that Friedman introduced in 
1992 (Fig. 4), whose structure is unknown before the modeling process [37]. The MARS model 
is a mathematical model whose internal function is based on a scattered polynomial and a piece 
known as the basis function (B) or splines. The k-node constrains the spacing, and the internal 
connections are applied at different time intervals from the input features. The MARS basis 
function is expressed as follows [38]: 


(k—-x fl if x<k 
B= {5 Otherwise (4) 
Ba{err Bek (5) 


where, q> 0 is the power that determines the polynomial function of the sub-piece. If q = 1, the 
splines are linear. If we want to obtain Y with M functions, the MARS model can be obtained 
by: 
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Y =f,,(X)=C,+ SC,B, (X ) (6) 


where, Y is a prediction made by the model and Co, Cm and Bm (X) are the constant, the basis 
function coefficient obtained by the least-squares method, and the basis function obtained by 
multiplying two or more functions, respectively, and M is the number of sentences in the final 
model. The modeling process in MARS is performed in forward and backward phases. In the 
first phase, important features are selected, while in the second phase, unnecessary samples are 
removed to prevent overfitting and enhance the model's accuracy. The unnecessary samples are 
removed by generalized cross-validation (GCV) as follows: 
N aw 
Se) 
i=l 
Gcv =___N_ (7) 
(M —1) 
M + p*\——_+ 
l 2 
N 


where, p is penalty parameter. For more information please see [12,39]. Figure 4 shows the 
MARS model with q=1 and one feature. 


x —— Ss 
. —- Response Backward phase 
X2 | =. 
—— —— Response 
—- ss 
| Response Basis functions | 


N 
Y = fin(X) = C + Cin Bm (X) 


= | _ Response 


Fig. 4. Schematic structure of the MARS model. 


2.6. Least square support vector machine (LS-SVM) 


Support vector machines are efficient learning systems based on bounded optimization theory, 
which employs the structural error minimization inductive principle that results in a general 
optimal solution presented by Cortes and Vapnik in 1995 (Fig. 5)[40]. LS-SVM is a productive 
tool for tackling nonlinear issues, classification, and function estimation. The following 
regression model is used in the LS-SVM model to estimate various problems [19]: 

Y (X,)=W! .O(X;) +b (8) 
In Eq.(8), ®(X,) are called nonlinear diagrams of the inputs in the feature space with high 


dimensions, and b and w are regression functions and weights of the dimensions of the same 
calculated property using objective function minimization according to the following equation: 
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: : I ¢ yl 2 
Min, & p em w +o at (9) 
with the following restrictions: 
yj =w! Ox, )+b +e, ,[=1, 2,3...N (10) 


in Eqs. (9, 10), e; is the error of training data, and / is the penalty parameter and is called 


gamma. Large gamma values lead to more contribution of the error term in the objective 
function. Finally, the estimation function of the LS-SVM model is defined as follows: 


yr) = Ea, K (x. ;) +b (11) 


where, K (x, ,x ) is kernel function described as a function of internal multiplication in the 


feature space. According to the following equation: 


2 
[es -*, 


20° ” 


K(x,,x ;) = exp _ 


where, o or sigma is kernel width. 


Bias 


Fig. 5. Schematic structure of the LS-SVM model. 
The gamma and sigma are essential parameters of LS-SVM that have essential influence on its 


efficiency. 


2.7. Random forest (RF) model 


The random forest algorithm, first introduced by Breiman in 2001[41], is a powerful and robust 
learning algorithm used for classification, regression analysis, and unsupervised learning goals 
[42]. In the RF algorithm, the user defines three parameters: Number of trees, minimum size in 
each terminal state or node size, and the number of variables to predict each tree [43] (Fig. 6). 
First, K random samples are generated by bootstrapping method. Then, for each sample, one 
decision tree is fitted. After that, the final results of RF are the average of the results of K trees. 
The final results in RF are estimated as follows [12,44]: 
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Fig. 6. Schematic structure of the random forest (RF) algorithm. 


| Prediction 2 | 2-8) 8 | Prediction N | 


2.8. Empirical models for evaluating ET, 


A common approach for calculating crop ET consists of calculating ET, and multiplying it by a 
crop coefficient. 


Turc [45] developed an equation for calculating daily potential evapotranspiration as a function 
of air temperature, relative humidity, and solar radiation. The Turc method depends on the 
relative humidity of the air. If the relative humidity is greater than 50%, then: 


0 
If the relative humidity is less than 50%: 


‘i 
ET, = 0.31—— (Rs + 2.09) (14) 
T +15 


0 T +15 
where, ET, is daily reference crop evapotranspiration (mm/d), R, is solar radiation (MJ/m’.d), T 
is mean daily air temperature (°C), and RH is average daily relative humidity (%). 


Jensen and Haise [46] developed an equation to predict potential evapotranspiration by 
combining the effect of temperature and solar radiation. 
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ET, = (0.02527, + 0.078)R, (16) 


ET, is in mm/d, T,, is mean daily temperature (°C), and Rs is short-wavelength incoming solar 
radiation to the earth's surface (MJ/m7.d). 


Hargreaves-Samani model is based on the maximum, minimum and average temperatures and 
radiation [47]: 


ET, = 0.0023R,,L rean + 17.8),/1 as (17) 


where, ET, is in mm/d, Vinay, Tinin, ANd Tnean are Maximum, minimum and average daily 
temperatures, respectively (°C), and R, is extraterrestrial radiation (mm/d). 


The United Nations Food and Agriculture Organization (FAO) has adopted the Penman-Monteith 
method in its Irrigation and Drainage Paper No. 56. Known as FAO 56 PM, this method is a 
global reference model for calculating reference crop evapotranspiration based on meteorological 
data [48]. It works well in different locations if the required data are available. It even works well 
in regions with limited data. Temperature, relative humidity, wind speed and solar radiation data 
are necessary for the FAO 56 PM method. 


This model is derived from the following eqation [49]: 


0.408A(R,, -G) 4 i aw 2(€, —€,) 
ET, = = 
? A+y(1+0.34U ,) i 


where, ET, is reference crop evapotranspiration (mm/d), 7 is mean daily air temperature at 2 m 
height (°C), U> is the wind speed at 2 m height (m/s), R,, is net radiation at the crop surface 
(MJ/m’.d), G is soil heat flux density (MJ/m7.d), e,-e, is saturation vapor pressure deficit (kPa), 
A is temperature-saturated vapor pressure curve gradient (kPa/°C), y is psychrometric constant 
(kPa/°C). 


This study selected 70% of the data as training data and 30% as testing data. Temperature, 
relative humidity, solar radiation, wind speed, and sunshine hours were used as model inputs, and 
the FAO 56 PM method is used as the output. 

2.9. Evaluation criteria 


The performance of data-mining approaches are compared based on the coefficient of 
determination (R°), mean absolute error (MAE), root mean squared error (RMSE), and mean 
square error (MSE). The equations for calculating these criteria are given as [50-52]: 


R= n (dire) Ue) (19) 
pdx (Lee) [fede (Lye) | 


MSE = — D(X, -Y,) (20) 


i=l 
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1 n 
RMSE = i 2 (Xo-¥) (21) 


1 
MAE =~) |X, -¥,| (22) 
In Eqs. (19-22), Xx is the observed value, Yx is the estimated value, and N is number of data. 


2.10. Model ranking 


In this study, models and methods were ranked according to the presented method by [52], and 
by considering computation time and measurement accuracy. The lower the computation time 
and the higher the measurement accuracy, the better the model. 


3. Results 


In this study, mean monthly meteorological parameters including temperature-based variables, 
humidity-based variables, sunshine hours, wind direction, and wind velocity are used in the 
classical and modern models of ANN, M5, MARS, LS-SVM, and RF. In the following section, 
sensitivity analysis, results of applying the abovementioned data-mining methods in estimation 
of ET, are reported for selected stations in Mazandaran and Semnan provinces. 


3.1. Sensitivity analysis of data 


The sensitivity analysis of input variables is done using correlation analysis in Semnan province 
(Fig. 7). In this method, the Pearson correlation between inputs and target (ET,) variables are 
estimated. If the Pearson correlation is positive, it means that by increasing input variable, ET, 
will increase. However, if Pearson correlation is negative, it means that by increasing the input 
variable, the ET, will decrease. According to Fig. 7, by increasing the temperature-based 
variables, wind direction, wind speed, and sunshine hours, the ET, increases, while increasing 
the humidity-based parameters decreases the ET». 
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Fig. 7. Sensitivity analysis of input data for Semnan province. 
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3.2. Sensitivity analysis of machine-learning methods 


13 


Figure 8 shows MSE values for different number of trees for testing RF in the investigated 
stations. According to Fig. 8, RF in Semnan, Shahrood, Garmsar, and Damghan has better results 
with 100, 400, 200, and 500 trees, respectively. These values for Sari and Ghaemshahr are equal 
to 100 and for Dasht- e-naz and Babolsar are equal to 350. According to the results of this figure, 
MSE value is specific for each station. This issue is probably due to the impact of climate typeon 
the results of data-mining metods. 
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Fig. 8. Sensitivity analysis of RF for selecting the number of trees in: a) Semnan, b) Shahrood, c) 
Garmsar, d) Damghan, e) Sari, f) Dasht-e-naz, g) Ghaemshahr, and h) Babolsar. 
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Figure 9 demonstrates the contour plot of MSE for different values of gamma and sigma. It is 
seen that the best values of (gamma, sigma) for Semnan, Shahrood, Garmsar, and Damghan are 
equal to (8, 3), (10, 4), (8.5, 2.25), and (6.5, 2.75). The mentioned values for Sari, Daht-e-naz, 
Ghaemshahr, and Babolsar are estimated as (10, 5.5), (10, 6.75), (8.75, 5), and (4.75, 3). 
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3.3. Estimation of ET, by data-mining models 


3.3.1. Mazandaran province 

The ET, estimation was performed by using M5, MARS, RF, LS-SVM, and ANN models for 
Sari station. As it is seen in Table 3, for all datasets, among the models, the MARS model has the 
highest mean coefficient of determination (0.9678) and coefficient of variation (0.0093). The 
MARS model also has the lowest MSE and RMSE. The recorded errors were 0.1149 and 0.3366, 
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respectively, and the coefficients of variation are 0.2397 and 0.1252. The M5 model has the 
lowest mean coefficient of determination (0.8548), coefficient of variation (0.0591), computation 
time (0.8272 s), lowest MAE (0.0478), and coefficient of variation (0.7894). The RF model has 
the highest error values. Mean values of MAE, MSE and RMSE are 0.0881, 0.1817, and 0.4435, 
respectively, and coefficients of variation are 0.4093, 0.3166, and 0.758, respectively. This model 
recorded a high computation time (1176.482 s). 


Table 3 
Statistics of data-mining models for Sari station. 


CV of MAE CV of MSE CV of RMSE CV of Time 


R? 
R? (mm/day) | MAE | (mm/day)? | MSE _ | (mm/day) | RMSE (s) 


M5 0.8548 | 0.0591 0.0478 0.7894 0.2352 0.3275 0.4795 0.1597 0.8272 


MARS | 0.9678 | 0.0093 0.0821 1.475 0.1149 0.2397 0.3366 0.1252 1.0377 


LS-SVM | 0.9657 | 0.0082 0.0318 0.1153 0.1238 0.7991 0.3504 0.1832 | 37.6515 


ANN 0.9406 | 0.0232 0.0525 0.7993 0.1989 0.1955 0.4429 0.0964 1.3575 


RF 0.9589 | 0.0068 0.0881 0.4093 0.1817 0.3166 0.4435 0.0758 | 1176.482 


Examination of the results of Babolsar station showed that among the models, the LS-SVM 
model has the highest coefficient of determination (0.9732), coefficient of variation (0.0072), the 
lowest error rate between the models (0.0382, 0.1071, and 0.3259, respectively) and coefficient 
of variation (0.7991, 0.1832, and 0.0943, respectively). The M5 model recorded the lowest mean 
coefficient of determination (0.8599), coefficient of variation (0.0962), and computation time 
(0.7864 s). Also, this model has the highest MAE (0.0885) and coefficient of variation (0.5125). 
The RF model has the highest MSE and RMSE values (0.2974 and 0.5357, respectively), and the 
coefficients of variation are 1.0948, 0.2939, and 0.1299, respectively. It also recorded a high 
average computation time of 1027.479 seconds. 


In the Dasht-e-Naz station, among the models, the LS-SVM model has the highest average 
coefficient of determination (0.9665) and a coefficient of variation of 0.0058. The MARS model 
has the lowest MAE, MSE, and RMSE (0.0551, 0.1334, and 0.3610, respectively) and 
coefficient of variation (0.3069, 0.1624, and 0.5986, respectively). The M5 model has the lowest 
mean coefficient of determination (0.8853), coefficient of variation (0.0494), and computation 
time of 0.6631 seconds. The RF model has the highest MSE and RMSE values (0.3003 and 
0.5457), coefficient of variation (0.867 and 0.962, respectively), and high average computation 
time (787.5662 s). 


Analysis of the results for the Ghaemshahr station showed that among the models, the MARS 
model has the highest average coefficient of determination (0.9761) as well as coefficient of 
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variation (0.0074). The MARS model also has the lowest error values of MAE, MSE, and RMSE 
(0.0551, 0.1334, and 0.361) and coefficients of variation of (0.5986, 1962, and 1041, 
respectively). The M5 model has the lowest mean coefficient of determination (0.8969), 
coefficient of variation (0.0525), and less computation time (0.7579 s). The RF model had the 
highest error values of MAE, MSE, and RMSE (0.0671, 0.2189, and 0.4659, respectively), and 
the coefficients of variation are 0.6879, 1948, and 0.0984, respectively. It also recorded a high 
average computation time (1051.348 s). 


According to the results, at all the stations, the M5 has the lowest R’. It also has the least 
computation time, which can be advantageous for this model, especially when time matters to us. 


Given the above results, the RF model has a good and acceptable coefficient of determination, 
but because of its high error rate and the high computation time, it is not recommended in the 
humid climate of Mazandaran province. Rankings of the models [53] are shown in Table 4. 


In terms of computation time, M5 is ranked first with a score of 4, MARS is rated 2" with a 
score of 8, and ANN, LSSVM, and RF models are next. 


Table 4 
Ranking of the smart models in terms of accuracy and computation time. 
Accuracy Computation time 
Station 
M5 | MARS | LSSVM | ANN | RF | M5 | MARS | LSSVM | ANN | RF 
Babolsar 5 2 1 3 4 1 2 4 3 5 
Dasht-e-Naz | 5 1 2 3 4 1 2 4 3 5 
Ghaemshahr | 5 1 2 3 4 1 2 4 3 5 
Sari 5 1 2 3 4 1 2 4 3 5 
Total 20 5 7 12 16 | 4 8 16 12 20 


In terms of accuracy, MARS is ranked 1“ with a score of 5, and LSSVM, ANN, RF, and M5 
models rank second to fifth. In terms of time and accuracy, MARS model is ranked 1° with a 
score of 13, and LSSVM model is 2" with a score of 23, ANN and M5 models have the 3" place 
with a score of 24, and RF model is ranked 4" with a score of 36. Figures 10 to 13 show the 
computed and observed MARS and M5 models in the Mazandaran climate. 


Finally, to provide a comprehensive model in the humid climate, the data of 4 synoptic stations 
of Mazandaran province were implemented together. Results showed that MARS model with R’, 
MAE, MSE, and RMSE values of 0.9637, 0.0267, 0.1266, and 0.3558, respectively, was the best 
model. 
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3.3.2. Semnan province stations 


Table 5 presents the results of data-mining models for Semnan station and all datasets. The LS- 
SVM model recorded the highest coefficient of determination (0.9896) and coefficient of 
variation (0.002). Also, this model has the highest MAE, MSE, and RMSE values (0.1036, 
0.3288, and 0.5706, respectively), and the coefficients of variation are 1.3585, 0.2041, and 
0.1059. The M5 model recorded the lowest coefficient of determination (0.9453) and coefficient 
of variation (0.0224). Also, the least computation time (1.024 s) is for M5. The ANN model has 
the lowest error rates of MAE, MSE, and RMSE (0.0393, 0.0987, and 0.03125, respectively), 
and the coefficients of variation are 0.9562, 0.02106, and 0.1117, respectively. 
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Table 5 
Statistics of data mining models for Semnan station. 


R? 
R (mm/day) | yar | (mm/day)’| yqgp | (mm/day) | pyMsE (s) 


MSs 0.9453 | 0.0224 | 0.043 0.6775 | 9.2075 | 9.2266 | g.4529 | 91137 | 104 


0.6961 0.3179 0.1625 


MARS | 0.9857 | 0.0049 0.0407 0.1031 0.3174 1.5812 


1.3585 0.2041 0.1059 


LS-SVM | 0.9896 | 0.002 0.1036 0.3288 0.5706 55.6474 


ANN | 0.9892 | 0.0012 | 0.0393 | 9:9962 | 0.9987. | 92106 | g3125 | 91117 | 1 4096 


RF 0.982 | 0.0038 | 0.0462 | 29-6243 0.13 9.1963 | 93589 «| 9-1013 | 1435.75 


For Damghan station, the LS-SVM model has the highest average coefficient of determination 
(0.9928) and coefficient of variation (0.0016). Also, this model has the highest MAE, MSE, and 
RMSE (0.1092, 0.3868, and 0.6161, respectively), and the coefficients of variation are 0.7157, 
0.2756, and 0.1405, respectively. The M5 tree model recorded the lowest average coefficient of 
determination (0.9713) and coefficient of variation (0.0087). Also, the least computation time for 
the M5 tree was 0.7864 seconds. The ANN model has the least MAE, MSE, and RMSE values 
(0.0331, 0.1016, and 0.3151, respectively), and their coefficients of variation are 0.6997, 0.322 
and 0.1607. 


Examination of the results for Garmsar station indicates that among the models, the LS-SVM 
model has the highest average coefficient of determination (0.9923) and a coefficient of variation 
of 0.002. Also, this model has the highest MAE, MSE and RMSE values (0.0854, 0.3742, and 
0.6099, respectively), and coefficients of variation (0.7615, 0.1642, and 0.0822, respectively). 
The M5 tree model recorded the lowest mean coefficient of determination (0.9653) and the 
coefficient of variation as 0.0043. Also, the lowest amount of computation time for M5 tree 
model is 0.9459 seconds. The MARS model has the least MAE, MSE, and RMSE values 
(0.0327, 0.0112, and 0.3198, respectively) and coefficients of variation (0.6628, 774, and 0.3251, 
respectively). 


For Shahrood station, the LS-SVM model has the highest average coefficient of determination 
(0.9925) and coefficient of variation (0.0022). Also, this model has the highest MAE, MSE, and 
RMSE values (0.1122, 0.92994, and 0.5451, respectively) and coefficients of variation (0.4365, 
0.1824, and 0.0918, respectively). The M5 tree model recorded the least coefficient of 
determination (0.9678) and coefficient of variation (0.0108). This model also has the least 
computation time (0.8991 s). The ANN model has the least MAE, MSE, and RMSE values 
(0.0328, 0.0756, and 0.2734, respectively), and the coefficients of variation are 0.6943, 0.2212, 
and 0.113, respectively. 
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Also, the abovementioned data-mining models (M5 tree, MARS, LS-SVM, ANN, and RF) are 
ranked for the four stations in the Semnan province. The results are presented in Table 6 and 
Table 7. 


Table 6 
Prioritizing data mining models in Semnan stations in terms of time. 


M5 | MARS | LSSVM | ANN | RF 

Damghan 1 2 4 3 5 
Garmsar 1 3 4 2 5 
Semnan 1 3 4 2 5 
Shahrood 1 2 4 3 5 
Total 4 10 16 10 20 


Table 7 
Prioritizing data mining models in Semnan stations in terms of accuracy. 


M5 | MARS | LSSVM | ANN | RF 
Damghan 5 2 3 1 4 
Garmsar 5 1 3 2 4 
Semnan 5 2 3 1 4 
Shahrood 4 2 3 1 3 
Total 19 7 12 5 17 


In terms of time, according to Table 6, the M5 tree model, with a score of 4, ranked 1“, and the 
RF model, with a score of 20, ranked last. 


In terms of accuracy, according to Table 7, the ANN model, with a score of 5, ranked 1°, and the 
MARS model, with a score of 7, ranked 2™. 


Taking the time and accuracy into account, the ANN model, with a total score of 15, had the least 
score and is chosen as the top model. The MARS, M5, LSSVM, and RF models rank next. 


Figures 14-17 show the computed and observed values of the ET) by ANN and M5 models in the 
Semnan province. 


Finally, to provide a comprehensive model in the arid climate, the data of 4 synoptic stations of 
Semnan province were combined with the best model (ANN model). The results are presented in 
Table 8. 
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Fig. 16. Computed and observed values of ET, data Fig. 17. Computed and observed values of ET, data 
by M5 model in Semnan climate for the training by M5 tree model in Semnan climate for the testing 
period. period. 


Table 8 
Comprehensive ETO estimation model in arid climate. 


Rr | CVof MAE CV of MSE CVof | RMSE CV of | TIME 
R? | (mm/day) | MAE | (mm/day)? | MSE _ | (mm/day) | RMSE (s) 


ANN model 
in Semnan 0.985 | 0.0028 0.0238 0.807 0.1292 0.1782 0.3582 0.0872 | 2.9515 
climate 


3.4. Results of empirical models in estimating ET, 


3.4.1. Meteorological stations of Mazandaran province 


In this section, the results are presented for all datasets. The Jensen-Haise method has the highest 
determination coefficient (0.9843) for Sari station (Table 9). It also has the lowest error values of 
MAE, MSE, and RMSE (0.8764, 1.3582, and 1.1654, respectively). The Turc method has the 
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least determination coefficient (0.9089) and the highest error values of MAE, MSE, and RMSE 
(2.0398, 5.523, and 2.3501, respectively). 


Table 9 
Statistics of the empirical models for Sari station. 
R’ MAE (mm/day) | MSE (mm/day)” eae 
Hargreaves-Samani 0.9571 1.3049 2.6099 1.6155 
Ture 0.9089 2.0398 5.523 2.3501 
Jensen-Haise 0.9843 0.8764 1.3582 1.1654 


At Babolsar station, the Jensen-Haise method has the highest determination coefficient (0.9897). 
The Hargreaves-Samani method has the lowest MAE, MSE, and RMSE values of 0.8334, 1.005, 
and 1.0025, respectively. The Turc method has the lowest determination coefficient (0.8907), and 
the highest MAE, MSE, and RMSE values are 2.947, 5.9462, and 2.4385, respectively. 


At Dasht-e-Naz station, the Jensen-Haise method has the highest determination coefficient 
(0.9818). It also has the lowest MAE, MSE, and RMSE values of 0.8295, 1.2084, and 1.0993, 
respectively. The Turc method has the lowest determination coefficient (0.887), and the highest 
error values (2.2157, 6.503, and 2.5501, respectively). 


At Gaemshahr station, the Jensen-Haise method has the highest determination coefficient 
(0.9702). It also has the lowest error values of 0.8054, 1.1236, and 1.06. The Ture method has 
the highest MAE, MSE, and RMSE values (2.0434, 5.6115 and 2.3689, respectively). The 
Hargreaves-Samani method has the lowest determination coefficient (0.7121). 


Based on the above results, the Jensen-Haise method has the highest determination coefficient 
(R’) and the lowest error values at all stations. The Turc method has the lowest determination 
coefficient, except for the Ghaemshahr station. It also recorded the highest error values at all 
stations. Thus, it can be concluded that the Jensen-Haise method is the best method for 
estimating ET, in the humid climate of Mazandaran. The Hargreaves-Samani method ranks 
second, and the Jensen-Haise method ranks third. 


3.4.2. Meteorological stations of Semnan province 


At the Semnan station (Table 10), the Jensen Haise method has the highest determination 
coefficient (0.9429). It also has the lowest error rate of 1.2959, 3.089, and 1.7576. The highest 
error rates were 2.8954, 11.3384, and 3.3673, respectively, and the lowest determination 
coefficient was 0.6695 for the Turc method. 


Table 10 
Statistics of the empirical models for Semnan station. 
R’ MAE (mm/day) | MSE (mm/day)’ | RMSE (mm/day) 
Hargreaves-Samani 0.8613 1.9572 5.5159 2.3486 
Ture 0.6695 2.8954 11.3384 3.3673 
Jensen-Haise 0.9429 1.2995 3.089 1.7576 
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For Damghan synoptic station, The Jensen-Haise method has the highest determination 
coefficient (0.968). It also has the lowest error values (0.8396, 1.1506, and 1.0727, respectively). 
The highest error values are 3.3673, 15.0858, and 3.8841, and the lowest determination 
coefficient is 0.907. 


The Hargreaves-Samani, Turc, and Jensen-Haise empirical methods are also investigated for 
Garmsar synoptic station. According to the results, the Jensen-Haise method has the highest 
determination coefficient (0.9347). It also has the lowest error values (1.1027, 2.2322, and 
1.4941, respectively). The highest error values are 3.2794, 14.6375, and 3.8259, respectively, and 
the lowest determination coefficient is 0.8955. 


Finally, for the Shahrood synoptic station, the Hargreaves-Samani method has the highest 
determination coefficient (0.9662). The Jensen-Haise method has the lowest error values 
(1.4531, 3.7023, and 1.9241, respectively). The highest error values are 2.8087, 10.1492, and 
3.1858, respectively. The lowest determination coefficient is 0.8793. 


Based on the above results, the Jensen-Haise method has obtained the highest determination 
coefficient and the lowest error values in all stations, except the Shahrood station (with very little 
difference from the Hargreaves-Samani method). The Ture method has the lowest determination 
coefficient among all the methods in all the stations. It also recorded the highest values of MAE, 
MSE, and RMSE values. According to the above results, it can be concluded that the Jensen- 
Haise method is the best method for estimating ET, in the dry climate of Semnan province. The 
Hargreaves-Samani method ranked second, and the Ture ranked third. 


A reasonable conclusion from this research is that the Jensen-Haise method is chosen as the 
superior method in both climates due to its high determination coefficient and low error values. 
In general, the results of this study, compared to other researches elsewhere, show high strength 
and ability of the proposed models in estimating the reference crop evapotranspiration. 


4. Discussion 


The critical difficulties in modeling with data-mining were the quality of inputs and target data, 
selecting machine-learning parameters, and calibration with deficient data. To this end, the pre- 
processing of input data was done by normalizing them. The parameters of data-mining methods 
were selected by sensitivity analysis as well as the experience of the authors. Random calibration 
method was used for training and testing data-mining methods to enhance accuracy. Also, by 
sensitivity analysis, most necessary inputs were selected for modeling ET,. One of the main 
challenges that the present study faced was external disturbances, modeling errors, and 
uncertainties. To overcome these cases, each model was run 20 times, and inputs in each run 
were generated by the bootstrapping method. Then, in each run, the accuracy criteria were 
calculated and their coefficient of variation in all runs were estimated and reported. Indeed, the 
lower coefficient of variation indicates lower uncertainty, and by the multiple running of data- 
mining methods and using bootstrapping method, the impacts of uncertainty were reduced. 
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The good accuracy of ANN in modeling ET, is reporte in [54-56] studies. According to the 
mentioned results, MARS and ANN have better results in humid and dry climates. Different 
parameters such as accuracy criteria, coefficient of variation of accuracy criteria, and 
computation times are considered, too. The reason for recommending ANN can be for its data 
processing in multilayers, using appropriate activation function and backpropagation as learning 
the algorithm, and consequently good accuracy and less computation time of this algorithm. 
While, other algorithms such as LSSVM and REF are not recommended, for significantly more 
computation times, despite their competitive accuracy. The more computation tim of LSSVM is 
for calculating kernel functions and trial and error for selecting their parameters. 


Moreover, high computation time of RF is due to generating several random samples and fitting 
one decision tree to each sample. The excellent accuracy of MARS algorithm could be due to 
using the divide and conquer strategy in this algorithm. In MARS, the input sets are divided into 
multi subsets, and one spline regression is fitted to each subset. This ability helps MARS to 
consider the nonlinear relations between inputs and outputs with good accuracy. It is worth 
mentioning that different results of data-mining methods in the two different climates could be 
due to different statistical characteristics of inputs and target data in different stations. For 
example, the variation of ET, in dry climate is more than humid climate. This leads to more 
accurate modeling of ET, in humid climate than in dry climate. For better accuracy of data- 
mining methods compared to empirical equations, it can be said that data-mining methods use 
black-box approaches and data of investigated regions, and process the inputs and outputs data 
which leads to modeling with more accuracy. 


The competitive results of MARS are similar to the studies by Mehdizaeh (2018) and Shan et al. 
(2020). On the other hand, the present study results indicated better accuracy of data-mining 
methods than empirical equations. This issue is reported by many studies such as Mehdizadeh et 
al. (2017) and Martin et al (2021). 


5. Conclusions 


Reference crop evapotranspiration (ET,) is a variable used in irrigation planning, water resources 
management, and hydrological studies. Its other application is to estimate crop water requirement 
in large irrigation areas. This study used five data-mining methods: ANN, M5 decision tree, 
MARS, LS-SVM, and RF algorithms. Also, valid and applied empirical methods of Turc, Jensen- 
Haise, and Hargreaves-Samani are applied in this study, too. The well-known FAO Penman- 
Monteith method was used as a base for comparing the other three empirical models. A total of 8 
synoptic stations in the humid and dry climates (Mazandaran and Semnan provinces, 
respectively) were considered. Mazandaran synoptic stations included Sari, Ghaemshahr, 
Babolsar, and Dasht-e Naz, and Semnan synoptic stations included Semnan, Shahrood, 
Damghan, and Garmsar. Results of this study revealed that in the humid climate of Mazandaran, 
the Jensen- Haise method was the best empirical model. The MARS model ranked first among 
the data-mining methods, and LS-SVM, ANN, RF, and M5 tree models ranked second to fifth. If 
the computation time is the criterion, the M5 tree model ranks first, and the MARS, ANN, LS- 
SVM, and RF models rank second to fifth. But if both accuracy and computation time are 
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considered, the MARS model ranks first, and ANN, LS-SVM, MS tree, and RF models are in the 
second to fifth place. Finally, among all the models, the MARS model and the Jensen-Haise 
method in the humid climate of Mazandaran province are selected as the top choice. 


In the dry climate of Semnan province, the Jensen-Haise method was the best method among the 
empirical equations. Among the data-mining models, the ANN comes first, and MARS, LS- 
SVM, RF, and MS tree rank second to fifth in terms of accuracy. The M5 tree model ranks first, 
and MARS, ANN, LS-SVM, and RF models rank second to fifth if the computation time is 
considered. However if accuracy and computation time are considered, the ANN model ranks 
first. Finally, the ANN model and Jensen-Haise method were selected as the best model in the 
dry climate of Semna province. The better accuracy of MARS was for using the divide and 
conquer strategy in this algorithm and fitting one spline in each subset of the original dataset. 
Also, the excellent accuracy of ANN was for processing data in multilayers. Different results of 
modeling ET, in each climate are due to different statistical characteristics of inputs and target 
times series. 


Data-mining methods have high potential in solving various civil engineering problems if there 
is a suitable length and quality of data, choosing an excellent pre-processing method and 
calibration. However, it is necessary to consider the accuracy, uncertainty, and computation time 
in selecting these algorithms. Furthermore, it is suggested that the accuracy of MARS is 
enhanced by selecting parameters of MARS using sensitivity analysis or optimization algorithms 
in the future studies. 
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